Computer implemented determination method and system

ABSTRACT

Methods for providing a computer implemented medical diagnosis are provided. In one aspect, a method includes receiving an input from a user comprising at least one symptom of the user, and providing the at least one symptom as an input to a medical model. The method also includes deriving estimates of the probability of the user having a disease from the discriminative model, inputting the estimates to the inference engine, performing approximate inference on the probabilistic graphical model to obtain a prediction of the probability that the user has that disease, and outputting the probability of the user having the disease for display by a display device.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of priority under 35 U.S.C. § 120 toand is a continuation of U.S. patent application Ser. No. 16/352,681,which application is the U.S. national stage under 35 U.S.C. § 371 ofInternational Application Number PCT/GB2018/053154, having aninternational filing date of Oct. 31, 2018, which claims benefit ofpriority to Great Britain Patent Application Number 1718003.5, filedOct. 31, 2017, and to Great Britain Patent Application Number 1818158.0,filed Sep. 27, 2018, all of which applications are incorporated hereinby reference.

This application is also related to U.S. patent application Ser. No.16/277,975 filed concurrently on Feb. 15, 2019; and is also related toU.S. patent application Ser. No. 16/277,970 filed concurrently on Feb.15, 2019.

FIELD

Embodiments of the present invention relate to the field of computerimplemented determination methods and systems.

BACKGROUND

Graphical models provide a natural framework for expressing theprobabilistic relationships between random variables in numerous fieldsacross the natural sciences. Bayesian networks, a directed form ofgraphical model, have been used extensively in medicine, to capturecausal relationships between entities such as risk-factors, diseases andsymptoms, and to facilitate medical decision-making tasks such asdisease diagnosis. Key to decision-making is the process of performingprobabilistic inference to update one's prior beliefs about thelikelihood of a set of diseases, based on the observation of newevidence.

BRIEF DESCRIPTION OF FIGURES

FIG. 1 is an overview of a system in accordance with an embodiment;

FIG. 2 is a schematic diagram of a simple graphical model;

FIG. 3 is a flow diagram showing the training of a discriminative modelto use the system of FIG. 1;

FIG. 4 is a flow diagram showing the use of the trained model with theinference engine of FIG. 1;

FIG. 5 is a basic schematic processing system with a GPU;

FIG. 6 is a schematic of an overview of a system in accordance with anembodiment;

FIG. 7 is a flow diagram showing the training of a discriminative modelto use the system of FIG. 6;

FIG. 8 is a flow diagram showing the use of the trained model with theinference engine of FIG. 6;

FIG. 9(a) is a schematic of a graphical Model and FIG. 9(b) thecorresponding UM architecture. The nodes of (a) the graph arecategorized by their depth inside the network and the weights of (b) theUM neural network are shared for nodes of the same category;

FIGS. 10a-10c shows the performance of the above system on threedifferent graphical models. FIG. 10(a) shows results from a syntheticgraph with 96 nodes, FIG. 10(b) shows results from a synthetic graphwith 768 nodes and FIG. 10(c) shows results from a medical PGM.Inference was applied through importance sampling with and without thesupport of a trained UM and it was evaluated in terms of PearsonCorrelation Coefficient (PCC), Mean Absolute Error (MAE) and EffectiveSampling Size (ESS); and

FIGS. 11a-11b shows the embeddings filtered for two sets of symptoms andrisk factors, where each scatter point corresponds to a set of evidence.FIG. 11(a) shows results for Diabetes embeddings and FIG. 11(b) showsresults for smoke and obesity embeddings. The display embedding vectorscorrespond to the first two components. It can be seen that theyseparate quite well unrelated medical concepts and show an overlap forconcepts which are closely related.

DETAILED DESCRIPTION

In an embodiment, a method for providing a computer implemented medicaldiagnosis is provided, the method comprising:

receiving an input from a user comprising at least one symptom;

providing at least one symptom as an input to a medical model;

using the medical model to determine the probability of the user havinga disease stored in the medical model from the provided input; and

outputting the probability of the user having one or more diseases,

wherein said medical model comprises a probabilistic graphical modelcontaining the probability distributions and the relationships betweensymptoms and diseases, an inference engine configured to performBayesian inference on said probabilistic graphical model using adiscriminative model, wherein the discriminative model has beenpre-trained to approximate the probabilistic graphical model, thediscriminative model being trained using samples generated from saidprobabilistic graphical model, wherein some of the data of the sampleshas been masked to allow the discriminative model to produce data whichis robust to the user providing incomplete information about theirsymptoms,

and wherein determining the probability that the user has a diseasecomprises deriving estimates of the probabilities that the user has thatdisease from the discriminative model, inputting these estimates to theinference engine and performing approximate inference on theprobabilistic graphical model to obtain a prediction of the probabilitythat the user has that disease.

Medical diagnosis systems require significant computing resources suchas processor capacity. The disclosed systems and methods solve thistechnical problem with a technical solution, namely by conductingapproximate statistical inference on a PGM with help of a discriminativemodel (e.g. a neural net) to provide an estimate of the posteriorprobabilities. The discriminative model is trained such that it isrobust to the user providing incomplete information about theirsymptoms. The above therefore allows the system to produce answers usingsuch new approximate inference with the accuracy comparable to usingexact or already existing approximate inference techniques, but in afraction of the time and with a reduction in the processing required.

The inference engine may be configured to perform importance samplingover conditional marginal. However, other methods may be used such asVariational Inference, other Monte Carlo methods, etc.

The discriminative model can be a neural network. In some embodiments,the neural network is a single neural network in other embodiments theneural network is as described in Further Embodiment A. The neural netcan approximate the outputs of the probabilistic graphical model andhence later in this document, it is termed a Universal Marginaliser(UM). In an embodiment, the probabilistic graphical model is a noisy-ORmodel.

As, performing probabilistic inference is computationally expensive, andin a medicine where large-scale Bayesian networks are required to makeclinically robust diagnoses, it is not feasible to apply exact inferencetechniques. Instead, approximate, sampling-based algorithms are usedwhich provide theoretical guarantees (under the central limit theorem)regarding convergence to the true posterior. In the context of medicaldiagnosis, this amounts to arriving at the true disease differential,based on the evidence and the underlying the model.

As the true ‘target’ (posterior) distribution is unknown ahead of time,the task of inference is to sample from an independent ‘proposal’distribution, which, ideally is as close as possible to the target. Thestandard approach when applying Bayesian networks for medicaldecision-making, is to use the model prior as the proposal distribution.However, this is often not ideal, particularly in cases where an unusualcombination of symptoms is generated by a rare disease. In these andsimilar cases, a large number of samples is often required to reduce thevariance in the estimate of the true posterior; this poses a significantpractical constraint to the use sampling algorithms for inference.

As a consequence, in all but the simplest of symptom presentations, itis often difficult to match the diagnostic speed of human doctors. Thisis because, for cognitive tasks, humans operate in the setting ofamortized inference i.e. ‘they have to solve many similar inferenceproblems, and can thus offload part of the computational work to sharedpre-computation and adaptation over time’. As discussed above, where theproposal is close to posterior, fewer samples need to be drawn, andtherefore inference will be more rapid.

FIG. 1 is a schematic of a method in accordance with an embodiment. Apatient 101 inputs their symptoms in step S103 via interface 105. Thepatient may also input their risk factors, for example, whether they area smoker, their weight, etc. The interface may be adapted to ask thepatient 101 specific questions. Alternately, the patient may just simplyenter free text. The patient's risk factors may be derived from thepatient's records held in a database (not shown). Therefore, once thepatient identified themselves, data about the patient could be accessedvia the system.

In further embodiments, follow-up questions may be asked by theinterface 105. How this is achieved will be explained later. First, itwill be assumed that the patient provide all possible information(evidence) to the system at the start of the process. This will be usedto explain the basic procedure. However, a variation on the procedurewill then be explained with patient only gives partial information withthe system, once completing the first analysis, requests furtherinformation.

The evidence will be taken to be the presence or absence of all knownsymptoms and risk factors. For symptoms and risk factors where thepatient has been unable to provide a response, these will assume to beunknown.

Next, this evidence is passed in step S107 to the inference engine 109.Inference engine 109 performs Bayesian inference on PGM 120. PGM 120will be described in more detail with reference to FIG. 2 after thediscussion of FIG. 1.

Due to the size of the PGM 120, it is not possible to perform exactinference using inference engine 109 in a realistic timescale.Therefore, the inference engine 109 performs approximate inference. Inan embodiment, the inference engine 109 is configured to performImportance Sampling. Importance sampling is described with reference toEquation 3 below.

When performing approximate inference, the inference engine 109 requiresan approximation of the probability distributions within the PGM to actas proposals for the sampling. In step S111, the evidence is passed towhat would be termed a universal marginaliser (UM) 113. The UM will bedescribed in more detail with reference to both FIGS. 3 and 4. Insummary, the UM is a neural network that has been trained to approximatethe outputs of the PGM 120.

The training of the UM will be described in detail with reference toFIG. 3. However, the UM is a model that can approximate the behaviour ofthe entire PGM 120. In one embodiment the UM is a single neural net, inanother embodiment, the model is a neural network which consists ofseveral sub-networks, such that the whole architecture is a form ofauto-encoder-like model but with multiple branches. Hence, why“universal” is used in the title of the UM. Further, the UM as will bedescribed with reference to FIG. 3 is trained to be robust to thepatient giving incomplete answers. This is achieved via the maskingprocedure for training the UM that will be described with reference toFIG. 3.

In step S115, the UM returns probabilities to be used as proposals tothe inference engine 109. The inference engine 109 then performsimportance sampling using the proposals from the UM as estimates and thePGM 120.

The inference engine 109 calculates “likelihood” (conditional marginalprobability) P(Disease_i|Evidence) for all diseases.

In addition the inference engine can also determine:

P(Symptom_i|Evidence),

P(Risk factor_i|Evidence)

From this, it can transmit back information in step S117 concerning the“likelihood” of a disease given the evidence supplied by the patient 101to the interface 105. The interface 105 can then supply this informationback to the patient in step S119.

The above high-level explanation of the system presumes that the patientprovide all possible evidence they can concerning their symptoms andthat the system has access to all possible risk factors of which thepatient can give a definitive answer. However, in many situations, thepatient will only give a fraction of this information as a first inputinto the system. For example, if the patient has a stomach ache, thepatient is likely to indicate that they have a stomach ache, butprobably little further information without prompting.

In a further embodiment, the system determines whether furtherinformation is required from the patient 101. As explained above, theinference engine 109 determines:

P(Disease_i|Evidence) for all diseases

-   -   P(Symptom_i|Evidence),    -   P(Risk factor_i|Evidence)

It is possible using a value of information analysis (VoI) to determinefrom the above likelihoods whether asking a further question wouldimprove the probability of diagnosis. For example, if the initial outputof the system seems that there are 9 diseases each having a 10%likelihood based on the evidence, then asking a further question willallow a more precise and useful diagnosis to be made. In an embodiment,the next further questions to be asked are determined on the basis ofquestions that reduce the entropy of the system most effectively.

In one embodiment, the analysis to determine whether a further questionshould be asked and what that question should be is based purely on theoutput of the UM 113 that provide an estimate of the probabilities.However, in a further embodiment, the probabilities derive directly fromthe PGM via importance sampling using the UM used to make this decision.

Once the user supplies further information, then this is then passedback and forth to the inference engine 109 to update evidence to produceupdated probabilities. FIG. 2 is a depiction of a graphical model of thetype used in the system of FIG. 1.

The graphical model provides a natural framework for expressingprobabilistic relationships between random variables, to facilitatecausal modelling and decision making. In the model of FIG. 1, whenapplied to diagnosis, D stands for diagnosis, S for symptom and RF forRisk Factor. Three layers: risk factors, diseases and symptoms. Riskfactors causes (with some probability) influence other risk factors anddiseases, diseases causes (again, with some probability) other diseasesand symptoms. There are prior probabilities and conditional marginalsthat describe the “strength” (probability) of connections. We usenoisy-OR and noisy-MAX modelling assumptions for now.

In this simplified specific example, the model is used in the field ofdiagnosis. In the first layer, there are three nodes S₁, S₂ and S₃, inthe second layer there are three nodes D₁, D₂ and D₃ and in the thirdlayer, there are two nodes RF₁, RF₂ and RF₃.

In the graphical model of FIG. 1, each arrow indicates a dependency. Forexample, D₁ depends on RF₁ and RF₂. D₂ depends on RF₂, RF₃ and D₁.Further relationships are possible. In the graphical model shown, eachnode is only dependent on a node or nodes from a different layer.However, nodes may be dependent on other nodes within the same layer.

In an embodiment, the graphical model of FIG. 1 is a Bayesian Network.In this Bayesian Network, the network represents a set of randomvariables and their conditional dependencies via a directed acyclicgraph. Thus, in the network of FIG. 2, given full (or partial) evidenceover symptoms S₁, S₂ and S₃ and risk factors RF₁, RF₂ and RF₃ thenetwork can be used to represent the probabilities of various diseasesD₁, D₂, and D₃.

The BN allows probabilistic inference to update one's beliefs about thelikelihood of a set of events, based on observed evidence. However,performing inference on large-scale graphical models is computationallyexpensive. To reduce the computational task, approximate inferencetechniques are used variational inference or Monte Carlo methods.

In summary

-   -   i. FIG. 2 represents a generative model P(RF,D,S)=P(RF) P(D|RF)        P(S|D).    -   ii. For simplicity for this explanation, it will be assumed that        all nodes are binary.    -   iii. A discriminative model UM is trained (e.g. feedforward        neural network) by sampling from the generative model.    -   iv. Each sample (i.e. combined vector (RF, D, S) becomes a        training example.        -   1. For the input of one training example, values are            “obscured” for each element of the sample vector with some            probability. The probability can be different depending on            the value of that vector element; if that is the case, the            cross-entropy loss should be weighted appropriately by the            probability.        -   2. The output contains exactly the sample without any            obscuring. Each element of the sample vector is a separate            independent output node.    -   v. The loss function is the cross-entropy for each output node.    -   vi. Since the cross-entropy is used, the discriminative model is        expected to        learn exactly conditional marginal P (node|partially_obscured        (RF, D, S)),        where node can be any risk factor, disease or symptom.    -   vii. This allows the use of the trained discriminative model to        either directly        approximate the posterior or use that approximation as a        proposal for any inference method (e.g. as a proposal for Monte        Carlo methods or as a starting point for variational inference,        etc).    -   viii. This conditional marginal approximation can then be used        to sample from the joint of the distribution by iterating node        by node with less and less risk factors and symptoms obscured.

In the method taught herein, it is possible to provide theoreticalguarantees (convergence to the true posterior) regarding the outcomesfrom inference. This is of particular use when the system is applied todecision making of a sensitive nature e.g. medicine or finance.

In an embodiment, since the true posterior (target distribution) isunknown, the task of inference is to sample from an independent‘proposal’ distribution, which ideally, is as close to the targetdistribution as possible.

When performing inference on Bayesian networks, a prior is often used asthe proposal distribution. However, in cases where the BN is used tomodel rare events, a large number of samples are required to reduce thevariance in an estimate of the posterior.

In an embodiment, inference is performed by considering the set ofrandom variables, X={X₁, . . . X_(N)}. A BN is a combination of adirected acyclic graph (DAG), with X_(i) as nodes, and a jointdistribution of the X_(i), P. The distribution P can factorize accordingto the structure of the DAG,

$\begin{matrix}{{P\left( {{X_{1}\mspace{14mu} \ldots}\mspace{14mu},X_{n}} \right)} = {{\prod\limits_{i = 1}^{N}\; {P\left( X_{i} \middle| {{Pa}\left( X_{i} \right)} \right)}} = {{P\left( X_{1} \right)}{\prod\limits_{i = 2}^{N}\; {{P\left( {\left. X_{i} \middle| X_{1} \right.,\ldots \mspace{14mu},X_{i - 1}} \right)}.}}}}} & (1)\end{matrix}$

Where P(X_(i)|Pa(X_(i))) is the conditional distribution of Xi given itsparents, Pa(X_(i)). The second equality holds as long as X₁; X₂; : : : ;X_(N) are in topological order.

Now, a set of observed nodes is considered,

⊂X and their observed values {circumflex over (x)}. To conduct Bayesianinference when provided with a set of unobserved variables, say

⊂X\

the posterior marginal is computed:

$\begin{matrix}{{P\left( {| = \hat{x}} \right)} = {\frac{P\left( {| = \hat{x}} \right)}{P\left( {= \hat{x}} \right)} = \frac{{P{()}}{P\left( {= \left. \hat{x} \right|} \right)}}{P\left( {= \hat{x}} \right)}}} & (2)\end{matrix}$

In the optimal scenario, Equation (2) could be computed exactly.However, as noted above exact inference becomes intractable in large BNsas computational costs grow exponentially with effective clique size,—in the worst case, becoming an NP-hard problem

In an embodiment, importance sampling is used. Here, a function ƒ isconsidered for which it's expectation, Ep[ƒ] is to be estimated, undersome probability distribution P. It is often the case that we canevaluate P up to a normalizing constant, but sampling from it is costly.

In Importance Sampling, expectation Ep[ƒ] is estimated by introducing adistribution Q, known as the proposal distribution, which can both besampled and evaluated. This gives:

$\begin{matrix}\begin{matrix}{{E_{p}\lbrack f\rbrack} = {\int{{f(x)}{P(x)}{dx}}}} \\{= {\int{{f(x)}\frac{P(x)}{Q(x)}{Q(x)}{dx}}}} \\{{= {\lim\limits_{n\rightarrow\infty}{\frac{1}{n}{\sum\limits_{i = 1}^{n}{{f\left( x_{i} \right)}w_{i}}}}}},}\end{matrix} & (3)\end{matrix}$

Where x_(i)˜Q and where w_(i)=P(x_(i))/Q(x_(i)) are the importancesampling weights. If P can only be evaluated up to a constant, theweights need to be normalized by their sum.

In the case of Inference on a BN, the strategy is to estimate P(

|

) with an importance sampling estimator if there is appropriate Q tosample from. In the case of using Importance Sampling to sample from theposterior, the weight also contains the likelihood P(

={circumflex over (x)}|

). Also, while the equalities in (3) hold for any appropriate Q, this istrue only in the limit as n→∞ and it is not the case that all importancesampling estimators have the same performance in terms of variance, orequivalently, time till convergence.

For example in likelihood weighting, if Q=P(

) is very far from P(

|

) then, with only a small number of weights dominating the estimate, thevariance of the estimator can be potentially huge and the estimationwill be poor unless too many samples are generated.

For this reason, the joint distribution Q=P(

|

) would be the optimal proposal and thus, it would be helpful to obtainan estimate of this to reduce the variance in importance sampling.

In an embodiment a discriminative model UM(:) (a feedforward neuralnetwork or a neural network, with architecture related to auto encoderbut with multiple branches) is trained to approximate any possibleposterior marginal distribution for any binary BN.

$\begin{matrix}{Y = {{{UM}{()}} \approx \begin{bmatrix}{P\left( \left. X_{1} \right| \right)} \\{P\left( \left. X_{2} \right| \right)} \\\vdots \\{P\left( \left. X_{n} \right| \right)}\end{bmatrix}}} & (4)\end{matrix}$

where n is the total number of nodes, and

are the observations. Y is a vector of conditional marginalprobabilities for every node in the BN, whether observed or not (if nodeX_i is observed, the marginal posterior distribution for it will betrial, i.e. P(X_(i)|

)=1 or P(X_(i)|

)=0).

To approximate any possible posterior marginal distribution i.e., givenany possible set of evidence

, only one model is needed. Due to this the discriminative model isdescribed as a Universal Marginalizer (UM). The existence of such anetwork is a direct consequence of the universal function approximationtheorem (UFAT). This is illustrated by considering marginalization in aBN as a function and that, by UFAT, any measurable function can bearbitrarily approximated by a neural network. Therefore, such a UM canbe used as a proposal for the distribution.

The training process for the above described UM involves generatingsamples from the underlying BN, in each sample masking some of thenodes, and then training with the aim to learn a distribution over thisdata. This process is explained through the rest of the section andillustrated in FIG. 3.

Such a model can be trained off-line by generating samples from theoriginal BN (PGM 120 of FIG. 1) via ancestral sampling in step S201. Inan embodiment unbiased samples are generated from the probabilisticgraphical model (PGM) using ancestral sampling. Each sample is a binaryvector which will be the values for the classifier to learn to predict.

In an embodiment, for the purpose of prediction then some nodes in thesample then be hidden, or “masked” in step S203. This masking is eitherdeterministic (in the sense of always masking certain nodes) orprobabilistic over nodes. In embodiment each node is probabilisticallymasked (in an unbiased way), for each sample, by choosing a maskingprobability p˜U[0,1] and then masking all data in that sample withprobability p. However, in a further embodiment, the masking process isas described in Further Embodiment A.

The nodes which are masked (or unobserved when it comes to inferencetime) are represented consistently in the input tensor in step S205.Different representations of obscured nodes will be described later, fornow, they will be represented them as a ‘*’.

The neural network is then trained using a cross entropy loss functionin step S207 in a multi-label classification setting to predict thestate of all observed and unobserved nodes. Any reasonable, i.e., atwice-differentiable norm, loss function could be used. In a furtherembodiment, the output of the neural net can be mapped to posteriorprobability estimates. However, when the cross entropy loss is used, theoutput from the neural net is exactly the predicted probabilitydistribution. In a further embodiment, the loss function is split fordifferent sub-sets of nodes for more efficient learning as described inFurther Embodiment A.

The trained neural network can then be used to obtain the desiredprobability estimates by directly taking the output of the sigmoidlayer. This result could be used as a posterior estimate. However, in afurther embodiment as described below the UM is combined with importancesampling to improve the accuracy of UM and the speed of importancesampling.

Thus a discriminative model is now produced which, given any set ofobservations

, will approximate all the posterior marginals in step S209. Note thatthe training of a discriminative model can be performed, as oftenpractised, in batches; for each batch, new samples from the model can besampled, masked and fed to the discriminative model training algorithm;all sampling, masking, and training can be performed on GraphicsProcessing Units, again as often practised.

This trained Neural net becomes the UM 113 of FIG. 1 and is used toproduce the predictions sent to the inference engine 109 in step S115.

In the embodiment, described with reference to FIG. 1, ImportanceSampling in the inference engine is augmented by using the predictedposteriors from the UM as the proposals. Using the UM+IS hybrid it ispossible to improve the accuracy of results for a given number ofsamples and ultimately speed up inference, while still maintaining theunbiased guarantees of Importance Sampling, in the limit.

In the above discussion of Importance sampling, we saw that the optimalproposal distribution Q for the whole network is the posterior itself P(

|

), and thus for each node the optimal proposal distribution isQ_(opt)=P(X_(i)∈

|

∪X_(S)), where

are the evidence nodes and X_(S) the already sampled nodes beforesampling X_(i).

As it is now possible, using the above UM to approximate, for all nodes,and for all evidences, the conditional marginal, the sampled nodes canbe incorporated into the evidence to get an approximation for theposterior and use it is as proposal. For node i specifically, thisoptimal Q* is:

Q* _(i) =P(X _(i) |{X ₁ , . . . ,X _(i−1) }∪

={circumflex over (x)})≈UM({X ₁ , . . . ,X _(i−1)}∪

)_(i)=Q_(i)  (5)

The process for sampling from these approximately optimal proposals isillustrated in the algorithm 1 below and in FIG. 4 where the part withinthe box is repeated for each node in the BN in topological order.

In step S301, the input is received and passed to the UM (NN). The NNinput is then provided to the NN (which is the UM) in step S303. The UMcalculates in step S305, the output q that it provides in step S307.This is the provided to the Inference engine in step S309 to sample nodeX_(i) from the PGM. Then, that node value is injected as an observationinto {circumflex over (x)}, and it is repeated for the next node (hence‘i=i+1’). In step S311, we receive a sample from the approximate joint.

Alogorithm 1 Sequential Universal Marginalizer importance sampling 1:Order the nodes topologically X₁, . . . X_(N), where N is the totalnumber of nodes. 2: for j in [1, . . . , M] (where M is the total numberof samples): do 3:  {tilde over (x)}_(S) = 0 4:  for i in [1, . . . ,N]: do 5:   sample node x_(i) from Q(X_(i)) = UM(

)_(i) ≈ P(X_(i)|X_(S),

) 6:   add x_(i) to {tilde over (x)}_(S) 7:  [x_(S)]_(j) = {tilde over(x)}_(S) 8:  $w_{j} = {\prod\limits_{i = 1}^{N}\; {\frac{P_{i}}{Q_{i}}\mspace{14mu} \left( {{{where}\mspace{14mu} P_{i}\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {likelihood}},\; {P_{i} = {P\left( {X_{i} = {{x_{i}\left. x_{S\bigcap{{Pa}{(X_{i})}}} \right)\mspace{14mu} {and}\mspace{14mu} Q_{i}} =}} \right.}}} \right.}}$Q(X_(i) = x_(i))) 9:${E_{p}\lbrack X\rbrack} = {\frac{\sum\limits_{j = 1}^{M}{X_{j}w_{j}}}{\sum\limits_{j = 1}^{M}w_{j}}\mspace{14mu} \left( {{as}\mspace{14mu} {in}\mspace{14mu} {standard}\mspace{14mu} {IS}} \right)}$

That is, following the requirement that parents are sampled before theirchildren and adding any previously sampled nodes into the evidence forthe next one, we are ultimately sampling from the approximation of thejoint distribution. This can be seen by observing the product of theprobabilities we are sampling from.

It can be seen that the proposal Q, constructed in such a way, becomesthe posterior itself:

$\begin{matrix}\begin{matrix}{Q = {\prod\limits_{i = 1}^{N}\; Q_{i}}} \\{= {{{UM}{()}}_{i}{\prod\limits_{i = 2}^{N}\; {{{UM}\left( {X_{1},\ldots \mspace{14mu},X_{i - 1},} \right)}_{i}(7)}}}} \\{\approx {{P\left( \left. X_{1} \right| \right)}{\prod\limits_{i = 2}^{N}\; {{P\left( {\left. X_{i} \middle| X_{1} \right.,\ldots \mspace{14mu},X_{i - 1},} \right)}(8)}}}} \\{= {{P\left( {X_{1},X_{2},\ldots \mspace{14mu},\left. X_{n} \right|} \right)}(9)}}\end{matrix} & (6)\end{matrix}$

This procedure requires that nodes are sampled sequentially, using theUM to provide a conditional probability estimate at each step. This canaffect computation time, depending on the parallelization scheme usedfor sampling. However, parallelization efficiency can be recovered byincreasing the number of samples, or batch size, for all steps.

In Importance Sampling, each node will be conditioned on nodestopologically before it. The training process may therefore be optimizedby using a “sequential masking” process in the training process as inFIG. 3, where firstly we randomly select up to which node X_(i) we willnot mask anything, and then, as previously, mask some nodes startingfrom node X_(i+1) (where nodes to be masked are selected randomly, asexplained before). This is to perform to a more optimal way of gettingtraining data.

Another way of doing an embodiment might involve a hybrid approach asshown in Algorithm 2 below. There, an embodiment might includecalculating the conditional marginal probabilities only once, given theevidence, and then constructing a proposal for each node X_(i) as amixture of those conditional marginals (with weight β) and theconditional prior distribution of a node (with weight (1−β)).

Alogorithm 2 Hybrid UM-IS 1: Order the nodes topologically X₁, . . .X_(N), where N is the total number of nodes. 2: for j in [1, . . . , M](where M is the total number of samples): do 3:  {tilde over (x)}_(S) =0 4:  for i in [1, . . . , N]: do 5:   sample node x_(i) from Q(X_(i)) =βUM_(i)(

)_(i) + (1 − β)P(X_(i) = x_(i)|x_(S∩Pa(x) _(i) ₎) 6:   add x_(i) to{tilde over (x)}_(S) 7:  [x_(S)]_(j) = {tilde over (x)}_(S) 8:  $w_{j} = {\prod\limits_{i = 1}^{N}\; {\frac{P_{i}}{Q_{i}}\mspace{14mu} \left( {{{where}\mspace{14mu} P_{i}\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {likelihood}},\; {P_{i} = {P\left( {X_{i} = {{x_{i}\left. x_{S\bigcap{{Pa}{(X_{i})}}} \right)\mspace{14mu} {and}\mspace{14mu} Q_{i}} =}}\; \right.}}} \right.}}$Q(X_(i) = x_(i))) 9:${E_{p}\lbrack X\rbrack} = {\frac{\sum\limits_{j = 1}^{M}{X_{j}w_{j}}}{\sum\limits_{j = 1}^{M}w_{j}}\mspace{14mu} \left( {{as}\mspace{14mu} {in}\mspace{14mu} {standard}\mspace{14mu} {IS}} \right)}$

While this hybrid approach might be easier and potentially lesscomputationally expensive, in cases when P(X_(i)|X_(S)∪

) is far from P(X_(i)|

), this will be just a first-order approximation, hence the variancewill be higher and we generally need more samples to get a reliableestimate.

The intuition for approximating P(X_(i)|

∪X_(S)) by linearly combining P(X_(i)|Pa(X_(i))) and UM(

)_(i) is simply that UM(

)_(i) will take into account the effect of the evidence on node i andP(X_(i)|Pa(X_(i))) will take into account the effect of X_(S), namelythe parents. Note that β could also be allowed to be a function of thecurrently sampled state and the evidence, for example if all theevidence is contained in parents then=0 is optimal.

FIG. 5 shows a layout of a system in accordance with a furtherembodiment of the invention.

The system 401 comprises a processor 403, the processor comprisesComputer Processing units (CPUs) 405 and Graphical Processing units(GPUs) 407 that operates under the control of the host. GPUs 407 offer asimplified instruction set that is well suited for a number of numericalapplications. Due to the simplified instruction set they are notsuitable for any general purpose computing in the same way that CPUs areused, however thanks to these simplifications GPU 407 can offer a muchlarge number of processing cores. This makes GPU 407 ideally suited forapplications where computations can be parallelised.

To achieve these performance gains algorithms usually need to bemodified to express this parallelism in a form that is easy to implementon GPUs. This can be done either via low level custom GPU instructions(e.g., implementing algorithm in terms of low level CUDA code) or,alternatively, algorithms can be expressed more generally in terms ofcommon vectorised operations, such as scatter, gather, reduce ontensors, as well as higher level numerical routines such as matrixtranspose, multiplication, etc.

To express vectorised operations and to make use of higher level tensorframeworks with GPU support, it is possible to use products such asTensorFlow, PyTorch, etc. Once calculations are expressed in vectorisedform, in an embodiment, it is possible to make use of the large numberof processing cores in modern GPUs by generating a large batch of randomnumbers, for the importance sampling procedure. The GPU uses dataacquired from the PGM 409.

The above approach using a GPU is used to determine the posteriormarginal probabilities P(D_(i)|evidence), P(S_(i)|evidence) andP(RF_(i)|evidence).

In an embodiment, the Noisy-Or model for the conditional priorprobabilities in the PGM is used (see for example Koller & Friedman2009, Probabilistic Graphical Models: Principles and Techniques The MITPress). In an embodiment, the procedure is modified to improve thenumerical stability and to parallelise the computation of theconditional priors.

To improve numerical accuracy, in an embodiment, most calculations areperformed in the log domain. From basic properties of the log functionmultiplication becomes addition. So in the example of the Noisy-Or modelinstead of calculating probabilities as a multiplications of lambdas,the sum of log(λ) is computed:

${{P\left( {X_{i} = {\left. F \middle| {{Pa}\left( X_{i} \right)} \right. = \left\lbrack {x_{k},\ldots \mspace{14mu},x_{l}} \right\rbrack}} \right)} = {{\lambda_{0}{\prod\limits_{j = k}^{l}\; {x_{j}\lambda_{j}}}} = {\exp\left( {{\log \left( \lambda_{0} \right)} + {\sum\limits_{j = k}^{l}{\log \left( {x_{j}\lambda_{j}} \right)}}} \right)}}},$

where x_(j)∈{0,1}.

To further improve performance the above is then expressed as a tensoroperation. Here, a lambda matrix Λ, is constructed where Λjk is equal tothe log lambda value of node j with node k, with Λjk=0 if node k is nota parent of node j. P(X_(i)|Pa(X))=ƒ([x_(k) . . . x_(l)]) can be thenexpressed as a ΣΛ_(j:)*S, where S is the samples tensor and * denoteselement-wise multiplication.

To illustrate this, firstly, the likelihood weighting method is shown.The following procedure generates one sample using likelihood weighting.A sample is a full instantiation of the network, that is all nodes inthe network will be assigned a state. Nodes that are in the evidence setE will be set to their observed state, whereas nodes not in the evidencewill be randomly sampled according to their conditional probabilitygiven their parents' state.

Procedure Generate-LW-Sample ( B, // Bayesian network over X E = e, //Evidence ) // conditional probabilities below are calculated usingNoisy-Or model Let X_1, ..., X_n be a topological ordering of X w = 1for i = 1, ..., n u_i = x<Pa(X_i)> // get the sampled state of theparents of X_i if X_i not in E then // if X_i not in evidence thensample x_i = sample from P(X_i | u_i) else x_i = e<X_i> // evidencestate of X_i in z w = w * P(X_i | u_i) // multiply weight by probabilityof the evidence state given the node's parents return (x_1, ..., x_n), w

It is then possible to estimate a probability query y by generating Msamples by repeating calling the procedure above M times and thencalculate the estimate as:

${P\left( y \middle| E \right)} = \frac{\sum\limits_{m = 1}^{M}{{w\lbrack m\rbrack}*I\left\{ {{y\lbrack m\rbrack} = y} \right\}}}{\sum\limits_{m = 1}^{M}{w\lbrack m\rbrack}}$

Where I is an indicator function which is equal to 1 if the sampledstate y of sample m is the same as the target y. In binary nodes, itjust means that all weights are summed where y is true and divide thatby the total sum of weights.

This procedure can then be extended for importance sampling as follows:

Procedure Generate-IS-Sample ( 

B, // Bayesian network over X 

E = e, // Evidence Q // proposal probability distribution ) //conditional probabilities below are calculated using Noisy-Or model LetX_1, ..., X_n be a topological ordering of X w = 1 for i = 1, ..., n u_i= x<Pa(X_i)> // get the sampled state of the parents of X_i  if X_i notin E then // if X_i not in evidence then sample p = P(X_i | u_i) q =Q(X_i | u_i, E) x_i = sample from q w = w * (p / q) else x_i = e<X_i> //evidence state of X_i in z w = w * P(X_i | u_i) // multiply weight byprobability of the evidence state given the node's parents return (x_1,..., x_n), w

The main difference here is that instead of sampling directly from P itis now possible to sample from Q and correct the weight by the ratiop/q. This can be parallelized on a GPU by simply generating a batch ofmultiple samples at a time:

Procedure IS-Generate-Batch-of-Samples ( B, // Bayesian network over X E= e, // Evidence Q, // proposal probability distribution K // batchsize 

) 

 // conditional probabilities below are calculated using Noisy-Or modelLet X_1, ..., X_n be a topological ordering of X w = [1, ..., 1] //vector of weights of size Kx1 for i = 1, ..., n u_i = x<Pa(X_i)> // getthe sampled state of the parents of X_i in each sample in the batch,dimension is Kx1 

if X_i not in E then // if X_i not in evidence then sample p = P(X_i |u_i) q = Q(X_i | u_i, E)

 x_i = sample K times from q // dimension Kx1 

w = w * (p / q) // multiply each element of w by p/q else 

x_i = e<X_i> // evidence state of X_i in z, Kx1 dimension w = w * P(X_i| u_i) // multiply weight by probability of the evidence state given thenode's parents, Kx1 dimension return [(x_1, ..., x_n)_1, ..., (x_1, ...,x_n)_K] , w // returns K samples and K weights

It is possible to make significant improvements to the efficiency of thetensor representation by optimising the number and size of the tensorsused to represent the network—essentially representing the network by anumber of ‘layers’ upon which independent sampling/calculations can beperformed. Working with smaller tensors increases the speed ofcomputation for the numerous tensor manipulations used throughout theinference process, but at the expense of increasing the number of thosemanipulations required due to the increased number of tensorsrepresenting the same network.

To optimise the decomposition of the network into layers, thetopologically sorted list of network nodes is split into multiplepotential ‘layers’ according via a grid search over three parametersbased on the size of tensors created by each layer, namely:

-   -   the minimum tensor size    -   the maximum tensor size    -   the total ‘waste’ incurred due to the sequentially increasing        tensor size    -   the resultant layers at each grid point are tested according to        the metric:        -   M=(10*number_of_layers)+total_waste

Where the total_waste is calculated as the penalty incurred by theincremental increase in individual layer's tensor size. The group oflayers with the lowest M are chosen as the optimum representation.

To improve sampling efficiency, in an embodiment, the current estimateof the posterior being calculated was mixed into the Importance Samplingproposal distribution Q. It was found that this helps with convergence.To calculate the current estimate the same probability query formula asabove was used with the samples generated so far.

q′(X_i|u_i,E)=(1−α)*q(X_i|u_i,Evidence)+α*current_estimate ofP(X_i|Evidence)

In a further embodiment, proposal probabilities q were kept within a maxprecision range to sampling efficiency of the importance sampler in somecases by requiring less samples to arrive at a target accuracy.

Clipping is performed as follows:

clipped_q=min(max(q,ε),1−ε)

Here ε was set to 0.001.

In a further embodiment, for importance sampling q was imposed (i.e. byredefining the proposal q:=clipping(q)) to be not too different from pin order to reduce the variance of the weights. One simple heuristicthat was found useful was to ensure that q is such that

$\frac{\gamma \; p}{p + \gamma - 1} < \frac{p}{q} < \gamma$

holds, with a typical value for γ of 2.

In a yet further embodiment, for some nodes an extension to the noisy-ORmodel was employed which is particularly useful for medical diagnosis,namely the noisy-MAX model. For a binary disease node, the noisy-ORmodel allows for a child node representing a symptom to be binary (e.g.absent|present). The noisy-MAX model however allows nodes to have one ofa variety of states. Therefore for a symptom node it becomes possible toencode the severity of the symptom, for example, by any number ofparticular states (e.g. absent|mild|strong|severe). Whereas in thenoisy-OR model each node-parent connection is described by a singleprobability parameter (lambda), the noisy-MAX algorithm requiresmultiple parameters describing the variety of multiple states in whichthe node can exist.

Noisy-MAX nodes are therefore implemented as well on GPUs in ourembodiment by adding an additional dimension to the probability valuelambda matrix, and producing categorical samples according to the valuesin this dimension (i.e. sampling from a number of possible states, asopposed to simply true/false).

A demonstration of the above is presented in Further Embodiment A.

In addition to demonstrate the above, the following experiments wereperformed:

The UM network was trained using cross-entropy loss. Specifically, ReLUnon-linearity was used with an applied dropout of 0.5 before each hiddenlayer and the Adam optimizer was used.

As noted above, there are many options for ways to represent theunobserved nodes on the input layer of the neural network when trainingthe UM as explained with reference to FIG. 3.

Three representations were trialled:

1. 32-bit Continuous Representation Represent false as 0, true as 1 andthe unobserved values by a point somewhere between 0 and 1. This isanalogous to the probability of the input being true. Three values wereused for unobserved: 0, 0.5 and the prior of the node.

2. 2-bit Representation Here, 1 bit was used to represent whether thenode is observed and another node to represent whether it is true.{(Observed, True), (Observed, False), (Unobserved,False)}={(1,1),(1,0),(0,0)} which is equivalent to {True, False,Unobserved}={1,2,3} in terms of information.

3. 33-bit Representation (1 bit+Continuous) The further option is tocombine these two approaches to allow one bit to represent whether thenode is observed or not and another variable to be a float between 0 and1 to represent the probability of this being true.

To measure the quality of the conditional marginals in themselves, atest set of some evidences. For each evidence, “ground truth” posteriormarginals were calculated via Likelihood Sampling with 300 millionsamples. Then two main metrics were used to assess the quality of thedistributions. Firstly the mean absolute error—which calculates theabsolute error between the true node posterior and predicted nodeposterior, then averaged over the test set of evidences.

The second metric is the max error—this looks for the maximumprobability error across all nodes in the predictions and then averagesthese over data points. A grid search was run on network size andunobserved representation and the results are reported using the twometrics in table 1.

TABLE 1 avg error/max error for 20,000 iterations 2 bit 33 bit (priors)[2048] 0.0063/0.3425 0.0060/0.3223 [4096] 0.0053/0.2982 0.0052/0.2951[1024, 1024] 0.0081/0.4492 0.0083/0.4726 [2048, 2048] 0.0071/0.40760.0071/0.4264

It can be seen that the largest one layer network performs the best. Thedifference between the two representations is not large, but the resultssuggest that providing the priors may help improve performance.

Next, experiments were performed to assess the use of UM posteriorestimates as proposals.

To do this comparison, the error to the test set over time was evaluatedas the number of samples increases. This was done for a standardlikelihood weighting and also Importance Sampling with the UM marginalsas a proposal distribution. Again both average absolute and max errorsover all the case cards were measured.

Firstly the approximate joint distribution as described above was usedwith empirically a very good beta of 0.1. With beta of 0.1 equivalentresults in 250k samples were seen as in likelihood weighting 750ksamples, so already this is a 3× speed up.

Although the above has been described in relation to medical data, theabove system could also be used for any determination process wherethere are a plurality of interlinked factors which are evidenced byobservations and a determination of a probable cause is required. Forexample, the above method can be used in financial systems.

Also, although the above has used the output of the discriminative modelas an aid in conducting approximate inference, in some cases, theestimate produced by the discriminative model may be used on its own.Also, the embeddings of such discriminative models (e.g. neuralnetworks) can serve as a vectorised representation of the providedevidence for tasks like classification and clustering or interpretationof node relationships, as described in further embodiment A.

Further Embodiment A

An framework for the embodiments described herein is shown in FIG. 6that shows the Universal Marginaliser (UM): The UM performs scalable andefficient inference on graphical models. This Figure shows one passthrough the network. First, (1) a sample is drawn from the PGM, (2)values are then masked and (3) the masked set is passed through the UM,which then, (4) computes the marginal posteriors.

As described above, probabilistic graphical models are powerful toolsthat allow formalisation of knowledge about the world and reason aboutits inherent uncertainty. There exist a considerable number of methodsfor performing inference in probabilistic graphical models; however,they can be computationally costly due to significant time burden and/orstorage requirements; or they lack theoretical guarantees of convergenceand accuracy when applied to large scale graphical models. To this end,the above described Universal Marginaliser Importance Sampler (UM-IS) isimplemented—a hybrid inference scheme that combines the flexibility of adeep neural network trained on samples from the model and inherits theasymptotic guarantees of importance sampling. The embodiment describedherein shows how combining samples drawn from the graphical model withan appropriate masking function allows training of a single neuralnetwork to approximate any of the corresponding conditional marginaldistributions, and thus amortise the cost of inference. It is also shownthat the graph embeddings can be applied for tasks such as: clustering,classification and interpretation of relationships between the nodes.Finally, the method is benchmarked on a large graph (>1000 nodes),showing that UM-IS outperforms sampling-based methods by a large marginwhile being computationally efficient.

In this embodiment, the Universal Marginaliser Importance Sampler(UM-IS), an amortised inference-based method for graph representationand efficient computation of asymptotically exact marginals is used. Inorder to compute the marginals, the UM still relies on ImportanceSampling (IS). A guiding frame-work based on amortised inference is usedthat significantly improves the performance of the sampling algorithmrather than computing marginals from scratch every time the inferencealgorithm is run. This speed-up allows the application of the inferencescheme on large PGMs for interactive applications with minimum errors.Furthermore, the neural network can be used to calculate a vectorisedrepresentation of the evidence nodes. This representation can then beused for various machine learning tasks such as node clustering andclassification.

The main contributions of this embodiment are as follows:

-   -   UM-IS is used as a novel algorithm for amortised inference-based        importance sampling. The model has the flexibility of a deep        neural network to perform amortised inference. The neural        net-work is trained purely on samples from the model prior and        it benefits from the asymptotic guarantees of importance        sampling.    -   The efficiency of importance sampling is significantly improved,        which makes the proposed method applicable for interactive        applications that rely on large PGMs.    -   It will be shown on a variety of toy network and on a medical        knowledge graph (>1000 nodes) that the proposed UM-IS        outperforms sampling-based and deep learning-based methods by a        large margin, while being computational efficient.    -   It will be shown that the networks embeddings can serve as a        vectorised representation of the provided evidence for tasks        like classification and clustering or interpretation of node        relationships.

As described above, the Universal Marginaliser (UM) is a feed-forwardneural network, used to perform fast, single-pass approximate inferenceon general PGMs at any scale. The UM can be used together withimportance sampling as the proposal distribution, to obtainasymptotically exact results when estimating marginals of interest. Thishybrid model will be referred to as the Universal MarginaliserImportance Sampler (UM-IS).

As described above, a Bayesian Network (BN) encodes a distribution Pover the random variables X={X₁, . . . , X_(N)} through a DirectedAcyclic Graph (DAG), the random variables are the graph nodes and theedges dictate the conditional independence relationships between randomvariables. Specifically, the conditional independence of a randomvariable X_(i) given its parents pa(X_(i)) is denoted asP(X_(i)|pa(X_(i))).

The random variables can be divided into two disjoint sets,

⊂X the set of observed variables within the BN, and

⊂X\

the set of the unobserved variables.

In this embodiment, a Neural Network (NN) is implemented as anapproximation to the marginal posterior distributions P(X_(i)|

=

) for each variable X_(i)∈X given an instantiation

of any set of observations.

is defined as the encoding of the instantiation that specifies whichvariables are observed, and what their values are. For a set of binaryvariables X_(i) with i∈0, . . . , N, the desired network maps theN-dimensional binary vector

⊂B^(N) to a vector in [0,1]^(N) representing the probabilitiesp_(i):=P(1|

=

)

Y=UM(

)≈(p ₁ , . . . ,p _(N)).  (A.1)

This NN is used as a function approximator, hence, it can approximateany posterior marginal distribution given an arbitrary set of evidence

. For this reason, this discriminative model is termed the UniversalMarginaliser (UM). If the marginalisation operation in a BayesianNetwork is considered as a function ƒ: B^(N)→[0,1]^(N), then existenceof a neural network which can approximate this function is a directconsequence of the Universal Function Approximation Theorem (UFAT). Itstates that, under mild assumptions of smoothness, any continuousfunction can be approximated to an arbitrary precision by a neuralnetwork of a finite, but sufficiently large, number of hidden units.Once the weights of the NN are optimised, the activations of thosehidden units can be computed to any new set of evidence. They are acompressed vectorised representation of the evidence set and can be usedfor tasks such as node clustering or classification.

Next, each step of the UM's training algorithm for a given PGM will bedescribed. This model is typically a multi-output NN with one output pernode in the PGM (i.e. each variable X_(i)). Once trained, this model canhandle any type of input evidence instantiation and produce approximateposterior marginals P(X_(i)=1|

=

).

The flow chart with each step of the training algorithm is depicted inFIG. 7. For simplicity, it will be assumed that the training data(samples for the PGM) is pre-computed, and only one epoch is used totrain the UM.

In practice, the following steps 1-4 are applied for each of themini-batches separately rather than on a full training set all at once.This improves memory efficiency during training and ensures that thenetwork receives a large variety of evidence combinations, ac-countingfor low probability regions in P. The steps are given as follows:

1. S501 Acquiring samples from the PGM. The UM is trained offline bygenerating unbiased samples (i.e., complete assignment) from the PGMusing ancestral sampling. The PGM described here only contains binaryvariables X_(i), and each sample S_(i)∈B^(N) is a binary vector. In thenext steps, these vectors will be partially masked as input and the UMwill be trained to reconstruct the complete unmasked vectors as output.

2. S503 Masking. In order for the network to approximate the marginalposteriors at test time, and be able to do so for any input evidence,each sample S_(i) is partially masked. The network will then receive asinput a binary vector where a subset of the nodes initially observedwere hidden, or masked. This masking can be deterministic, i.e., alwaysmasking specific nodes, or probabilistic. Here a different maskingdistribution is used for every iteration during the optimizationprocess. This is achieved in two steps. First, two random numbers from auniform distribution i,j˜U[0,N] are sampled where N is the number ofnodes in the graph. Next, making is performed from randomly selectedi(j) number of nodes the positive (negative) state. In this way, theratio between the positive and negative evidence and the total number ofmasked nodes is different with every iteration. A network with a largeenough capacity will eventually learn to capture all these possiblerepresentations.

There is some analogy here to dropout in the input layer and so thisapproach could work well as a regulariser, independently of thisproblem. However, it is not suitable for this problem because of theconstant dropout probability for all nodes.

3. S505 Encoding the masked elements. Masked elements in the inputvectors S^(masked) artificially reproduce queries with unobservedvariables, and so their encoding must be consistent with the one used attest time. The encodings are detailed below.

4. S507 Training with Cross Entropy Loss. The NN was trained byminimising the multi-label binary cross entropy of the sigmoid outputlayer and the unmasked samples S_(i).

5. S509 Outputs: Posterior marginals. The desired posterior marginalsare approximated by the output of the last NN-layer. These values can beused as a first estimate of the marginal posteriors (UM approach);however, combined with importance sampling, these approximated valuescan be further refined (UM-IS approach).

The UM is a discriminative model which, given a set of observations

, will approximate all the posterior marginals. While useful on its own,the estimated marginals are not guaranteed to be unbiased. To obtain aguarantee of asymptotic unbiasedness while making use of the speed ofthe approximate solution, the estimated marginals are used for proposalsin importance sampling. A naïve approach is to sample each X_(i)∈

independently from UM(

)_(i), where UM(

)_(i) is the i-th element of vector UM(

). However, the product of the (approximate) posterior marginals may bevery different to the true posterior joint, even if the marginalapproximations are good.

The universality of the UM makes the following scheme possible, whichwill be termed the Sequential Universal Marginaliser Importance Sampling(SUM-IS). A single proposal is sampled

sequentially as follows. First, a new partially observed state isintroduced

and it is initialised to

. Then, [

]₁˜UM(

)₁ is sampled and the previous sample

is updated such that X₁ is now observed with this value. This process isrepeated, at each step sampling [

]₁˜UM(

)₁, and

is updated to include the new sampled value. Thus, the conditionalmarginal can be approximated for a node i given the current sampledstate

and evidence

to get the optimal proposal Q⁵⁶⁰ _(i) as follows:

Q ⁵⁶⁰ _(i) *=P(X _(i) |{X ₁ , . . . ,X _(i−1)}∪

)≈UM(

)_(i).  (A.2)

Thus, the full sample

is drawn from an implicit encoding of the approximate posterior jointdistribution given by the UM. This is because the product of sampledprobabilities from Equation A.3 is expected to yield low varianceimportance weights when used as a proposal distribution.

$\begin{matrix}\begin{matrix}{Q = {{{UM}\left( {\overset{\sim}{x}}_{} \right)}_{1}{\prod\limits_{i = 2}^{N}\; {{UM}{()}}_{i}}}} \\{\approx {{P\left( \left. X_{1} \right| \right)}{\prod\limits_{i = 2}^{N}\; {{P\left( {\left. X_{i} \middle| X_{1} \right.,\ldots \mspace{14mu},X_{i - 1},} \right)}.\left( {A{.4}} \right)}}}}\end{matrix} & \left( {A{.3}} \right)\end{matrix}$

The process for sampling from these proposals illustrated in Algorithm1A and in FIG. 8. The nodes are sampled sequentially using the UM toprovide a conditional probability estimate at each step. Thisrequirement can affect computation time, depending on theparallelisation scheme used for sampling. In our experiments, weobserved that some parallelisation efficiency can be recovered byincreasing the number of samples per batch.

Alogorithm 1A Sequential Universal Marginalizer importance sampling 1:Order the nodes topologically X₁, . . . X_(N), where N is the totalnumber of nodes. 2: for j in [1, . . . , M] (where M is the total numberof samples): do 3:  {tilde over (x)}_(S) = 0 4:  for i in [1, . . . ,N]: do 5:   sample node x_(i) from Q(X_(i)) = UM(

)_(i) ≈ P(X_(i)|X_(S),

) 6:   add x_(i) to {tilde over (x)}_(S) 7:  [x_(S)]_(j) = {tilde over(x)}_(S) 8:  $w_{j} = {\prod\limits_{i = 1}^{N}\; {\frac{P_{i}}{Q_{i}}\mspace{14mu} \left( {{{where}\mspace{14mu} P_{i}\mspace{14mu} {is}\mspace{14mu} {the}\mspace{14mu} {likelihood}},\; {P_{i} = {P\left( {X_{i} = {{x_{i}\left. x_{S\bigcap{{pa}{(X_{i})}}} \right)\mspace{14mu} {and}\mspace{14mu} Q_{i}} =}} \right.}}} \right.}}$Q(X_(i) = x_(i))) 9:${E_{p}\lbrack X\rbrack} = {\frac{\sum\limits_{j = 1}^{M}{X_{j}w_{j}}}{\sum\limits_{j = 1}^{M}w_{j}}\mspace{14mu} \left( {{as}\mspace{14mu} {in}\mspace{14mu} {standard}\mspace{14mu} {IS}} \right)}$

The architecture of the UM is shown in FIG. 9. It has a denoisingauto-encoder structure with multiple branches—one branch for each nodeof the graph. In the experiments, it was noticed that the cross entropyloss for different nodes highly depends on the number of parents and itsdepth in the graph. To simplify the network and reduce the number ofparameters, the weights of all fully connected layers that correspond tospecific type of nodes are shared. The types are defined by the depth inthe graph (type 1 nodes have no parents, type 2 nodes have only type 1nodes as parents etc.). The architecture of the best performing model onthe large medical graph has three types of nodes and the embedding layerhas 2048 hidden states.

In experiments, the best performing UM was chosen in terms of MeanAbsolute Error (MAE) on the test set for the subsequent experiments.ReLU non-linearities, apply dropout was used on the last hidden layerand the Adam optimization method was used with batchsize of 2000 samplesper batch for parameter learning. Batch normalization was also usedbetween the fully connected layers. To train the model on a largemedical graphical model, in total a stream of 3×10¹¹ samples was used,which took approximately 6 days on a single GPU.

Experiments were performed on a large (>1000 nodes) proprietary BayesianNetwork for medical diagnosis representing the relationships betweenrisk factors, diseases and symptoms. An illustration of the modelstructure is given in FIG. 10(c).

Different NN architectures were tried with a grid search over the valuesof the hyperparameters and on the number of hidden layers, number ofstates per hidden layer, learning rate and strength of regularisationthrough dropout.

The quality of approximate conditional marginals was measured using atest set of posterior marginals computed for 200 sets of evidence viaancestral sampling with 300 million samples. The test evidence set forthe medical graph was generated by experts from real data. The testevidence set for the synthetic graphs was sampled from a uniformdistribution. Standard importance sampling was used, which correspondsto the likelihood weighting algorithm for discrete Bayesian networkswith 8 GPUs over the course of 5 days to compute precise marginalposteriors of all test sets.

Two main metrics are considered: the Mean Absolute Error (MAE) given bythe absolute difference of the true and predicted node posteriors andthe Pearson Correlation Coefficient (PCC) of the true and predictedmarginal vectors. Note that we did not observe negative correlations andtherefore both measures are bounded between 0 and 1. The EffectiveSample Size (ESS) statistic was used for the comparison with thestandard importance sampling. This statistics measures the efficiency ofthe different proposal distributions used during sampling. In this case,there was not access to the normalising constant of the posteriordistribution, the ESS is defined as (Σ_(i=1) ^(M)w_(i))²/Σ_(i=1)^(M)(w_(i))², where the weights, w_(i), are defined in Step 8 ofAlgorithm 1A.

A one hot encoding was considered for the unobserved and observed nodes.This representation only requires two binary values per node. One valuerepresents if the node is observed and positive ([0,1]) and the othervalue represents whether this node is observed and negative ([1,0]). Ifthe node is unobserved or masked, then both values are set to zero([0,0]).

In this section, first the results of different architectures for the UMwill be discussed, then the performance of importance sampling will becompared with different proposal functions. Finally, the efficiency ofthe algorithm will be discussed.

A hyperparameter grid search was used on the different networkarchitectures and data representations. The algorithmic performance wasnot greatly affected for different types of data representations. It ishypothesised that this is due to the fact that neural networks areflexible models capable of handling different types of in-putsefficiently by capturing the representations within the hidden layers.In contrast, the network architecture of the UM strongly depends on thestructure of the PGM. For this reason, a specific UM needs to be trainedfor each PGM. This task can be computation-ally expensive but once theUM is trained, it can be used to compute the approximate marginals in asingle forward pass on any new and even unseen set of evidence.

In order to evaluate the performance of sampling algorithms, the changein PCC and MAE on the test sets was monitored with respect to the totalnumber of samples. It was noticed that across all experiments, a fasterincrease in the maximum value or the PCC is observed when the UMpredictions are used as proposals for importance sampling. This effectbecomes more pronounced as the size of the graphical model increases.FIG. 10 indicates standard IS (blue line) reaches PCC close to 1 and anMAE close to 0 on the small network with 96 nodes. In this case of verysmall graphs, both algorithms converge quickly to the exact solution.However, UM-IS (orange-line) still outperforms IS and converges faster,as seen in FIG. 10(a). For the synthetic graph with 798 nodes, standardIS reaches an MAE of 0.012 with 10⁶ samples, whereas the UM-IS error is3 times lower (0.004) for the same number of samples. The sameconclusions can also be drawn for PCC. Most interestingly, on the largemedical PGM (FIG. 10(c)), the UM-IS with 10⁵ samples exhibits betterperformance than standard IS with 10⁵ samples in terms of MAE and PCC.In other words, the time (and computational costs) of the inferencealgorithm is significantly reduced by factor of ten or more. It wasexpected for this improvement to be even stronger on much largergraphical models. The results of a simple UM architecture as a baselineare also included. This simple UM (UM-IS-Basic) has one single hiddenlayer that is shared for all nodes of the PGM. It can be seen that theMAE and PCC still improved over standard IS. However, UM-IS withmultiple fully connected layers per group of nodes significantlyoutperforms the basic UM by a large margin. There are two reasons forthis. First, the model capacity of the UM is higher which allows tolearn more complex structures from the data. Secondly, the losses in theUM are spread across all groups of nodes and the gradient update stepsare optimised with the right order of magnitude for each group. Thisprevents the model from overfitting to the states of a specific type ofnode with a significant higher loss.

Extracting meaningful representations form the evidence set is anadditional interesting feature of the UM. In this section, it isdemonstrated that the qualitative results for this application. Thegraph embeddings are extracted as the 2048 dimensional activations ofthe inner layer of the UM (see FIG. 9). They are a low-dimensionalvectorised representation of the evidence set in which the graphsstructure is preserved. That means that the distance for nodes that aretightly connected in the PGM should be smaller than the distance tonodes than are independent. In order to visualise this feature, thefirst two principal components of the embeddings from different evidencesets are plotted in which it is known that they are related. Theevidence set from the medical PGM is used with different diseases,risk-factors and symptoms as nodes. FIG. 11(a) shows that the embeddingsof sets with active Type-1 and Type-2 diabetes are collocated. Althoughthe two diseases have different underlying cause and connections in thegraphical model (i.e pancreatic beta-cell atrophy and insulin-resistancerespectively), they share similar symptoms and complications (e.gcardiovascular diseases, neuropathy, increased risk of infections etc.).A similar clustering can be seen in FIG. 11(b) for two cardiovascularrisk factors: smoking and obesity, interestingly collocated with a signseen in patient suffering from a severe heart condition (i.e unstableangina, or acute coronary syndrome): chest pain at rest.

To further asses the quality of the UM embeddings, experiments areperformed for node classification with different features and twodifferent classifiers. More precisely, a SVM and Ridge regression modelwith thresholded binary output was trained for multi-task diseasedetection. These models were trained to detect the 14 most frequentdiseases from (a) the set of evidence or (b) the embedding of that set.A 5-fold standard cross validation was used with a grid search over thehyperparameter of both models and the number of PCA components for datapreprocessing. Table 1A shows the experimental results for the two typesof features. As expected, the models that were trained on the UMembeddings reach a significantly higher performance across allevaluation measures. This is mainly because the embeddings of theevidence set are effectively compressed and structured and also preservethe information form the graph structure. Note that the mapping from theevidence set to the embeddings was optimised with a large number ofgenerated samples (3*10¹1) during the UM learning phase. Therefore,these representations can be used to build more robust ma-chine learningmethods for classification and clustering rather then using the rawevidence set to the PGM.

TABLE 1A Classification performances using two different features. Eachclassifier is trained on - dense the dense embedding as features, andinput - the top layer (UM input) as features. The target (output) isalways the disease layer. Linear SVC Ridge dense input dense input F10.67 ± 0.01 0.07 ± 0.00 0.66 ± 0.04 0.17 ± 0.01 Precision 0.84 ± 0.030.20 ± 0.04 0.81 ± 0.06 0.22 ± 0.04 Recall 0.58 ± 0.02 0.05 ± 0.00 0.59± 0.04 0.16 ± 0.01 Accuracy 0.69 ± 0.01 0.31 ± 0.01 0.63 ± 0.02 0.27 ±0.01

The above embodiment paper discusses a Universal Marginaliser based on aneural network which can approximate all conditional marginaldistributions of a PGM. It is shown that a UM can be used via a chaindecomposition of the BN to approximate the joint posterior distribution,and thus the optimal proposal distribution for importance sampling.While this process is computationally intensive, a first-orderapproximation can be used requiring only a single evaluation of a UM perevidence set. The UM on multiple datasets and also on a large medicalPGM is evaluated demonstrating that the UM significantly improves theefficiency of importance sampling. The UM was trained offline using alarge amount of generated training samples and for this reason, themodel learned an effective representation for amortising the cost ofinference. This speed-up makes the UM (in combination with importancesampling) applicable for interactive applications that require a highperformance on very large PGMs. Furthermore, the use of the UMembeddings was explored and it has been shown that they can be used fortasks such as classification, clustering and interpretability of noderelations. These UM embeddings make it possible to build more robustmachine learning applications that rely on large generative models.

Next, for completeness, an overview of importance sampling and how it isused for computing the marginals of a PGM given a set of evidence willbe described.

In BN inference, Importance Sampling (IS) is used to provide theposterior marginal estimates P(

|

). To do so, samples

are drawn from a distribution Q(

|

), known as the proposal distribution. The proposal distribution must bedefined such that both sampling from and evaluation can be performedefficiently.

Provided that P(

,

) can be evaluated, and that this distribution is such that

contains the Markov boundary of

along with all its ancestors, IS states that posterior estimates can beformed as shown in Equation B1 below:

$\begin{matrix}{{{P\left( {X_{} = {\left. x_{} \right| = x_{}}} \right)} = {{\int{1_{x_{}}(x)\frac{P\left( x \middle| x_{} \right)}{Q\left( x \middle| x_{} \right)}{Q\left( x \middle| x_{} \right)}{dx}}} = {{\frac{Q\left( x_{} \right)}{P\left( x_{} \right)}{\int{1_{x_{}}(x)\frac{P\left( {x,x_{}} \right)}{Q\left( {x,x_{}} \right)}{Q\left( x \middle| x_{} \right)}{dx}}}} = {\lim\limits_{n\rightarrow\infty}{\sum\limits_{i = 1}^{n}{1_{x_{}}\left( x_{i} \right)\frac{w_{i}}{\sum_{j = 1}^{n}w_{j}}}}}}}},} & \left( {B\; 1} \right)\end{matrix}$

where x_(i)˜Q and w_(i)=P(x_(i),

)/Q(x_(i),

) are the importance sampling weights and 1_(x) _(U) (x) is an indicatorfunction for

.

The simplest proposal distribution is the prior, P(

) However, as the prior and the posterior may be very different(especially in large networks) this is often an inefficient approach. Analternative is to use an estimate of the posterior distribution as aproposal. In this work, we argue that the UM learns an optimal proposaldistribution.

In an embodiment, for sampling from the posterior marginal a BN can beconsidered with Bernoulli nodes and of arbitrary size and shape.Consider two specific nodes, X_(i) and X_(j), such that X_(j) is causedonly and always by X_(i):

P(X _(j)=1|X _(i)=1)=1,

P(X _(j)=1|X _(i)=0)=0.

Given evidence E, it can be assumed that P(X_(i)|E)=0:001=P(X_(j)|E). Itwill now be illustrated that using the posterior distribution P(X|E) asa proposal will not necessarily yield the best result.

Say we have been given evidence, E, and the true conditional probabilityof P(X_(i)|E)=0:001, therefore also P(X_(j)|E)=0:001. It would benaively expected that P(X|E) to be the optimal proposal distribution.However we can illustrate the problems here by sampling with Q=P(X|E) asthe proposal.

Each node k∈N will have a weight w_(k)=P(X_(k))/Q(X_(k)) and the totalweight of the sample will be

$w = {\prod\limits_{k = 0}^{N}\; {w_{k}.}}$

The weights should be approximately 1 if Q is close to P. However,consider the w_(j). There are four combinations of X_(i) and X_(j).X_(i)=1, X_(j)=1 only will be sampled, in expectation, one every millionsamples, however when the weight is determined w_(j) will bew_(j)=P(X_(j)=1)/Q(X_(j)=1)=1=0:001=1000. This is not a problem in thelimit, however if it happens for example in the first 1000 samples thenit will outweigh all other samples so far. As soon as there is a networkwith many nodes whose conditional probabilities are much greater thantheir marginal proposals this becomes almost inevitable. A furtherconsequence of these high weights is that, since the entire sample isweighted by the same weight, every node probability will be effected bythis high variance.

Further embodiments are set out below:

A method, of using the embeddings of using the above discriminativemodel as a vectorised representation of the provided evidence forclassification.

A method of using the embeddings of using the above discriminative modelas a vectorised representation of the provided evidence for clustering.

A method of using the embeddings of using the above discriminative modelas a vectorised representation of the provided evidence forinterpretation of node relationships.”

While certain embodiments have been described, these embodiments havebeen presented by way of example only, and are not intended to limit thescope of the inventions. Indeed the novel methods and systems describedherein may be embodied in a variety of other forms; furthermore, variousomissions, substitutions and changes in the form of methods and systemsdescribed herein may be made without departing from the spirit of theinventions. The accompanying claims and their equivalents are intendedto cover such forms of modifications as would fall within the scope andspirit of the inventions.

1. A method of training a discriminative model to approximate the output of a probabilistic graphical model, comprising: receiving by the discriminative model samples from said probabilistic graphical model; and training the discriminative model using said samples, wherein some of the data of the samples has been masked to allow the deterministic model to produce data which is robust to the user failing to input at least one symptom.
 2. The method of claim 1, wherein the masking is based on a uniform distribution.
 3. The method of claim 1, wherein the discriminative model is a neural network.
 4. The method of claim 3, wherein the neural network is a neural network comprising a plurality of sub-networks that can approximate the outputs of the probabilistic graphical model.
 5. The method of claim 3, wherein the neural network is a single neural network that can approximate the outputs of the probabilistic graphical model.
 6. The method of claim 1, wherein the probabilistic graphical model is a noisy-OR model.
 7. The method of claim 1, wherein the probabilistic graphical model expresses probabilistic relationships between variables comprising diagnosis, symptoms and risk factors for medical diagnosis.
 8. A method of deriving a vectorised representation of evidence, the method comprising: training a discriminative model to approximate the output of a probabilistic graphical model, the probabilistic graphical model by: inputting into the discriminative model samples from said probabilistic graphical model; and training the discriminative model using said samples, the discriminative model being a neural net comprising a plurality of layers, wherein the input layer to the neural net comprises the evidence, the vectorised representation being extracted from an inner layer of the neural net.
 9. The method of claim 8, wherein the vectorised representation is extracted from the activations of an inner layer.
 10. The method of claim 8, wherein the probabilistic graphical model expresses probabilistic relationships between variables comprising diagnosis, symptoms and risk factors for medical diagnosis, wherein evidence is the presence or absence of symptoms and risk factors.
 11. The method of claim 8, wherein the vectorised representation is used for classification.
 12. The method of claim 8, wherein the vectorised representation is used for clustering.
 13. The method of claim 8, wherein the vectorised representation is used for interpretation of node relationships within the probabilistic graphical model.
 14. A non-transitory carrier medium comprising computer readable code configured to cause a computer to perform a method of training a discriminative model to approximate the output of a probabilistic graphical model, comprising: receiving by the discriminative model samples from said probabilistic graphical model; and training the discriminative model using said samples, wherein some of the data of the samples has been masked to allow the deterministic model to produce data which is robust to the user failing to input at least one symptom.
 15. A non-transitory carrier medium comprising computer readable code configured to cause a computer to perform a method of deriving a vectorised representation of evidence, the method comprising: training a discriminative model to approximate the output of a probabilistic graphical model, the probabilistic graphical model by: inputting into the discriminative model samples from said probabilistic graphical model; and training the discriminative model using said samples, the discriminative model being a neural net comprising a plurality of layers, wherein the input layer to the neural net comprises the evidence, the vectorised representation being extracted from an inner layer of the neural net.
 16. A system for training a discriminative model to approximate the output of a probabilistic graphical model, comprising a processor, said processor comprising a probabilistic graphical model and a discriminative model, wherein the processor is adapted to: receive by the discriminative model samples from said probabilistic graphical model; and train the discriminative model using said samples, wherein some of the data of the samples has been masked to allow the deterministic model to produce data which is robust to the used failing to input at least one symptom.
 17. The system of claim 16, wherein the processor comprises a graphical processing unit.
 18. The system of claim 16, wherein the masking is based on a uniform distribution.
 19. The system of claim 16, wherein the probabilistic graphical model expresses probabilistic relationships between variables comprising diagnosis, symptoms and risk factors for medical diagnosis.
 20. A system for deriving a vectorised representation of evidence, the system comprising a processor adapted to: train a discriminative model to approximate the output of a probabilistic graphical model, the probabilistic graphical model by: inputting into the discriminative model samples from said probabilistic graphical model; and training the discriminative model using said samples, the discriminative model being a neural net comprising a plurality of layers, wherein the input layer to the neural net comprises the evidence, the vectorised representation being extracted from an inner layer of the neural net. 